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Abstract 

In randomi^d control trials (RCTs) in the education field, the complier average causal effect (CAGE) 
parameter is ofiten ofi policy interest, because it pertains to intervention effects fior students who receive a meaningfiul 
dose ofi treatment services. This report uses a causal infierence and instrumental variables firamework to examine 
the identifitcation and estimation ofi the CAGE parameter for two-level clustered RCTs. The report also provides 
simple asymptotic variance formulas for GAGE impact estimators measured in nominal and standard deviation 
units. In the empirical work, data from ten large RCTs are used to compare significance findings using correct 
GAGE variance estimators and commonly-used approximations that ignore the estimation error in service receipt 
rates and outcome standard deviations. Our key finding is that the variance corrections have very little effect on 
the standard errors of standardii(ed GAGE impact estimators. Across the examined outcomes, the correction 
terms typically raise the standard errors by less than 1 percent, and change p-values at the fourth or higher 
decimal place. 
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Chapter 1 : Introduction 



Randomized control trials (RCTs) in the education field typically examine the intention- to-treat {ITT) 
parameter, which is estimated by comparing the mean outcomes of treatment group members (who are 
offered intervention services) to those of control group members (who are not). RCTs also sometimes 
examine two policy-relevant variants of the ITT parameter. The first variant is the complier average 
causal effect {CACE) parameter, defined as the average impact of intervention services on those who 
comply with their treatment assignments (Bloom 1984; Angrist et al. 1996). Estimators for this parameter 
are obtained by adjusting the ITT impact estimators for those in the treatment group who do not receive 
intervention services and for crossovers in the control group who erroneously receive intervention 
services. Second, it is becoming increasingly popular to standardize ITT and CACE impact estimates into 
effect size (standard deviation) units. This metric is useful for comparing findings across outcomes that 
are measured on different scales, for interpreting impacts that are difficult to understand in nominal units, 
and for comparing study findings to those from previous evaluations (Cohen 1988; Lipsey and Wilson 
1995; Hedges 1981 and 2007). 

This report addresses two main issues. First, it systematically examines the identification of the CACE 
parameter under clustered RCT designs that are typically used in the education field, where units (such as 
schools or classrooms) rather than students are randomly assigned to a treatment or control condition. 
Using a causal inference and instrumental variables (IV) framework, we extend the identification 
conditions in Angrist et al. ( 1 996) to two-level clustered designs, where treatment compliance decisions 
can be made by both school staff and students. Our emphasis differs from Jo et al. (2008) who focus on 
parametric and path modeling of treatment noncompliance under clustered designs using multilevel 
mixture models and maximum likelihood methods. 

The second purpose of the report is to theoretically and empirically examine variance estimation under 
clustered designs for two types of IV estimators: (1) CACE estimators in nominal units, and (2) ITT and 
CACE estimators in effect size units — hereafter referred to as standardized estimators. These estimators 
are ratio estimators, whose variances must account for estimation errors in their numerators and 
denominators. In practice, however, analysts often ignore the estimation error in the denominator terms, 
which are assumed to be known. Thus, in study reports, the same t-statistics and p-values are sometimes 
reported for all estimators. 

A potential problem with this approach, however, is that it could lead to significance findings that are 
biased if the variance correction terms for the denominators matter. Accordingly, we present simple 
asymptotic variance estimation formulas for commonly -used ratio estimators by combining variance 
results in Hedges (2007) for standardized ITT estimators with those in Little et al. (2006) and Heckman et 
al. (1994) for CACE estimators. We then use data from ten large-scale RCTs to compare significance 
findings using the correct variance formulas with those that are typically used in practice, an empirical 
issue that has not been systematically addressed in the literature. The empirical results can be used to help 
guide future decisions as to whether the correct, but more complex variance formulas are warranted for 
RCTs in the education area to obtain rigorous significance findings for the full range of impact estimators. 

The remainder of this report is in six chapters. Chapter 2 discusses the causal inference framework 
underlying the ITT estimator for two-level clustered designs, which forms the foundation for the CACE 
analysis. Chapter 3 discusses impact and variance estimation of the ITT parameter, and Chapter 4 
discusses identification and estimation of the CACE parameter. Chapter 5 discusses estimation of the 
impact parameters in effect size units. Chapter 6 discusses empirical findings, and the final chapter 
presents a summary and conclusions. 
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Chapter 2: The Theoretical Framework Underlying the ITT 
Parameter 



We consider two-level clustered designs where students are nested within units (such as schools, 
classrooms, or districts) that are randomly assigned to a single treatment or control group — the most 
common designs used in large-scale RCTs in the education field. The results that are presented for two- 
level designs, however, can be collapsed to obtain results for nonclustered designs where students are the 
unit of random assignment. This is because nonclustered designs are a special case of clustered designs 
where every cluster has one student and there is no within-cluster variance. 

We consider a “superpopulation” version of the Neyman-Rubin causal inference model (see Rubin 1974; 
Imbens and Rubin 2007; Schochet 2008). It is assumed that the sample contains n units (groups), with np 
treatment units and n{\-p) control units, where p is the sampling rate to the treatment group (0<p<l). Let 
Wn be the “potential” unit-level outcome for unit i when assigned to the treatment condition and Wa be 
the potential outcome for unit i in the control condition. These potential outcomes are assumed to be 
random draws from potential treatment and control outcome distributions in the study population with 

means pr and //c, respectively, and common variance (7^ . It is assumed that the potential outcomes for 
each unit are independent of the treatment status of other units. 

Suppose next that w, students are sampled from the student superpopulation within study unit i. Let Ynj 
and Yaj be student-level potential outcomes (conditional on unit-level potential outcomes) that are 
random draws from potential outcome distributions with means Wn and Wa, respectively, and common 
variance cr^ > 0 . 

Under this causal inference model, the difference between the two potential outcomes, {Wj- — , is the 

unit-level treatment effect for unit i, and the ITT (or average treatment effect) parameter is 

E{Wri — Wa) ~ Et~ Ec- unit-level treatment effects, and hence, the ITT parameter, cannot be 

calculated directly because for each unit and student, the potential outcome is observed in either the 
treatment or control condition, but not in both. Formally, if T, is a treatment status indicator variable that 
equals 1 for treatments and 0 for controls, then the observed outcome for a unit, , can be expressed as 
follows: 



( 1 ) w,^T^WnH^-T,)Wa. 

Similarly, the observed outcome for a student y^j is: 

( 2 ) 

The simple equations in (1) and (2) form the basis for the estimation models that are considered in this 
report. 

The terms in (1) can be rearranged to create the following regression model: 

(3) yy = «o + + {u, -He..), where 

1 . aQ—jU(, and ttjjj = //j, — //(- (the ITT parameter) are coefficients to be estimated; 
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2. Wj — //^) + (l — — //p) is a unit- level error term with mean zero and 

between-unit variance <T^ that is uncorrelated with T . ; and 

3. e^j = T.{Yj^j - + (1 -T^){Y^.j - is a student-level error term with mean zero and 

within-unit variance <7^ that is uncorrelated with U- and . 

Importantly, (3) can also be derived using the following two-level hierarchical linear model (HLM) (Bryk 
and Raudenbush 1992): 

Level 1 : y,, = tv, -l- e,. 

Level 2 : tv. = -f ajj.j,T. -t- u. , 

where Level 1 corresponds to students and Level 2 to units. Inserting the Level 2 equation into the Level 
1 equation yields (3). Thus, the HLM approach is consistent with the causal inference theory presented 
above. 

Finally, baseline covariates can be included in (3) as “irrelevant” variables to improve the precision of the 
impact estimates, which yields the following estimation model: 

(4) Yij = «o + (^iTT^i + {Xy - X. )'/ + Z'.S + (m, -I- e.j ) 



where X-j is a vector of student-level baseline covariates that is centered around the unit-level covariate 

mean X . ; y is a parameter vector that is associated with Xy ; Z^ is a vector of unit-level baseline 

covariates (that could include X- and stratum indicators) with associated parameter vector 5 ; and M, 

and e* are error terms that are now conditional on the covariates. We center the Xy covariates around the 

unit-level means so that we can separately identify the effects of covariates on the within- and between- 
unit variance components. 
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Chapter 3: /7T Impact and Variance Estimators 



In this chapter, we use the models in (3) and (4) to discuss ITT estimators in nominal units, because they 
form the foundation for the CAGE and standardized estimators. We focus on commonly used differences- 
in-means and analysis of covariance (ANCOVA) estimators, which are used for the empirical analysis. 

We make the simplifying assumption that m,. = m for all units (that is, equal cluster sizes). Cluster sizes 

are often similar for RCTs in the education area (and for the RCTs examined in our empirical work), and 
variance formulas are much more complex with unequal cluster sizes. Furthermore, the formulas 
presented in this chapter apply approximately for unequal unit sizes that do not vary substantially across 
units if m is replaced in the formulas by the average unit size in (Kish 1965) or, preferably, by 
[« / ^ (1 / )] (Fledges 2007). 



The Simple Differences-In-Means Estimator 

The simple differences-in-means ITT estimator can be obtained by applying standard regression 
methods to (3). The resulting estimator is as follows: 

(5) djYpi “Tr^Tc’ 



_ \ "P _ \ n0--p) Y 

where ^ TA Tc = 7 X ^ — X y>j ■ estimator is the average 



np i:T=l 



n{\-p) ■: 



T,=0 



in, ,= 



i=i 



difference between cluster means across the treatment and control groups. 



Schochet (2008) shows that is asymptotically normally distributed with mean and the 
following asymptotic variance: 



( 6 ) 



AsyVar(dj„^) = — 

p(\-p) 




n nm 



The within-unit (second) variance term in (6) is the conventional variance expression for an impact 
estimator in a nonclustered design where random assignment is conducted within units. Design effects in 
a clustered design arise because of the first between-unit variance term, which represents the extent to 
which mean outcomes vary across units (Murray 1998; Donner and Klar 2000). 

An asymptotically unbiased estimator for the within-unit variance <7^ is as follows (Cochran 1963; 
Fledges 2007): 






(7) = 



/=! 7=1 



n(m - 1) 



Similarly, an asymptotically unbiased estimator for the between-unit variance (T^ is: 
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(^2 

(8) al = si — , where 

m 

np n{\-p) 

+ Yj ^Si-ycf 

( 9 ) Sl=^ . 

n-2 

Note that equation (9) can also be expressed in terms of regression residual sums of squares: 

hy-%f 

( 10 ) = ; — , 

n-2 

where is the predicted value for unit i from the between-unit regression of on T, and an intercept. 
Inserting (7) and (8) into (6) yields the following variance estimator for djjj ^ : 

(1 1) AsyVar{d,„x) = — . ■ 

np{\-p) 

This estimator also applies to nonclustered designs where units are defined as students. 



The Analysis of Covariance (ANCOVA) Estimator 

The ANCOVA estimator djjj 2 be obtained by applying regression methods to (4) where baseline 
covariates (such as pretests) are included in the analytic models, primarily to improve the precision of the 
impact estimates. Schochet (2008) shows that djj.j 2 is asymptotically normally distributed with mean 
ajjj and the following asymptotic variance: 

( 12 ) AsyVar(d2„2)= ^ + . 

p\y- p)\^ n nm 
2 2 

In this expression, cr^j and cr^j are between- and within-unit variances, respectively, that are 

conditional on the co variates, and reduce (7^ and (7^ depending on the size of the outcome-co variate 
correlations in the joint superpopulation distributions (these are adjustments). 

Using methods that are parallel to the simple differences-in-means estimator presented above, a consistent 
variance estimator for djjj 2 in (12) is as follows: 

( 13 ) AsyVar{d,„2) = — 77^ ’ 

np(l-p) 

where Sl^ is obtained using (10) with the following changes: (1) y. is now the predicted value for unit i 
from the between-unit regression of y. on Q- = [1 T. Z.]; and (2) {n — 2) is replaced by (n — k) where 
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k is the rank of the matrix Q whose rows contain the s. In practice, T’ and may be weakly 
correlated due to random sampling and missing data. Thus, (13) can be refined as follows: 

(14) AsyVar{d„^^)^{Q'Q)~^\Sl,. 

Finally, in our empirical work, we also used STATA to estimate more efficient generalized least squares 
models that allowed for unequal cluster sample sizes. Specifically, we used generalized estimating 
equation (GEE) methods with the sandwich variance estimator (Eiang and Zeger 1986), and full and 
restricted maximum likelihood approaches to general linear mixed models (Eittell et al. 1996; Bryk and 
Raudenbush 1992). The empirical results using these methods are very similar to those that are presented 
in this report, and thus, are not reported. 
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Chapter 4: The CACE Parameter 

The ITT estimator provides information on treatment effects for those in the study population who were 
offered intervention services. The treatment group sample used to estimate this parameter, however, might 
include not only students who received services but also those who did not. Similarly, the control group 
sample may include crossovers who received embargoed intervention services for advertent or inadvertent 
reasons. In these cases, the ITT estimates may understate intervention effects for those who were eligible 
for and actually received services (assuming that the intervention improves outcomes). Thus, it is often of 
policy interest to estimate the CACE parameter that pertains to those who complied with their treatment 
assignments. 

It is important to recognize that if treatment group noncompliers existed in the evaluation sites, they are 
likely to exist if the intervention were implemented more broadly. Thus, the ITT parameter pertains to 
real-world treatment effects. The CACE parameter, however, is important for understanding the “pure” 
effects of the intervention for those who received meaningful intervention services, especially for efficacy 
studies that aim to assess whether the studied intervention can work. Decision makers may also be 
interested in the CACE parameter if they believe that intervention implementation could be improved in 
their sites. Furthermore, the CACE parameter can be critical for drawing policy lessons from ITT effects; 
for instance, the CACE parameter can distinguish whether a small ITT effect is due to low rates of 
compliance or due to small treatment effects among compilers, with each scenario implying different 
strategies for improving intervention effects. 



Sources of Noncompliance 

Under clustered RCT designs in the education area, the extent to which students receive intervention 
services could depend on compliance decisions made by both school staff (such as superintendents, 
principals, and teachers) and students. The interplay between these sources will depend on the particular 
intervention and study design (Jo et al. 2008). Furthermore, the extent of compliance will depend on the 
approach for defining service dosage, which is a topic that is beyond the scope of this report. For context, 
however, in what follows, we briefly discuss general sources of noncompliance at the school and student 
levels. 



Noncompliance by School Staff 

School staff in treatment units may not offer intervention services for several reasons. First, school 
principals or district superintendents may change their minds about implementing the intervention, due to 
changes in school priorities or for other reasons. Second, even if schools agree to participate, some 
teachers may not, perhaps due to initial problems implementing the intervention or because they prefer 
their status quo teaching methods or curricula. In addition, noncompliance could occur if school personnel 
are not adequately trained in intervention procedures. Similarly, crossovers could occur if staff in control 
schools decide to offer the intervention (or a very similar one), perhaps because of a strong belief that the 
intervention is effective (from discussions with evaluators) and a strong desire to implement it 
immediately rather than after the embargo period. 



Noncompliance by Students 

Students may also play a role in noncompliance for several reasons. First, a student may not receive 
meaningfril intervention services due to a lack of school attendance. This could occur, for example, if the 
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student is suspended, is chronically absent, or, if relevant, decides not to attend a voluntary program (for 
example, an after-school program). 

Second, student mobility in and out of the study schools could lead to a low dosage of service receipt. In 
some designs, follow-up data are collected only for students who are present in the study schools at 
baseline (to ensure that the treatment and control group student samples will have similar baseline 
characteristics). In these designs, noncompliers may include those who left the treatment schools soon 
after the start of the school year. A more common “placed-based” design, however, is when follow-up 
data are collected for all students in the target grades who are in the study schools at data collection, 
including those who entered the schools after baseline. In these designs, noncompliers could include 
students who entered the study schools soon before follow-up data collection. ' Under either design, 
crossovers could occur due to student mobility if control students in the follow-up sample transfer to 
treatment schools or classrooms. 



Identification of the CAGE Parameter 

This section discusses the identification of the CAGE parameter under two scenarios. First, to fix ideas, 
we assume that compliance is determined solely by school staff, and that all students who are offered 
services receive them. Second, we consider the more general case where compliance is determined by 
both schools and students, in which case some students may not receive services even if their schools 
offer them. For both scenarios, treatment status ( T’ ) is determined at random assignment and is fixed 

thereafter; values are not affected by compliance decisions. We assume also that if the RCT uses a 

“placed-based” design as discussed above, there are no treatment effects on student mobility. Finally, 
because the literature has conceptualized compliance decisions as dichotomous (Angrist et al. 1 996), we 
model the offer and receipt of services as binary decisions. 

In what follows, we introduce some new notation. Let — Rj{Tfi denote an indicator variable that 
equals 1 if unit i would ofifer intervention services if assigned to a given treatment condition ( T. —0 or 
T’ = 1), and let W^{T^,Rfi) denote the unit’s potential outcome for a given value of (T.,R^); there are four 
such potential outcomes. Similarly, let D-j = D-j{T.,R.) denote an indicator variable that equals 1 if the 
student receives intervention services from any study school, given one of the four possible combinations 
of (T-,R^) . Finally, let Yy(T.,R.,D.j) denote the student’s potential outcome, given one of the possible 

combinations of (T- , R^ , D^j ) . 



The CAGE Parameter When Compliance Decisions Are Made by Units Only 

To identify the GAGE parameter when treatment compliance decisions are made by units only, we 
classify units into four mutually exclusive compliance categories: compliers, never-takers, always-takers, 
and defiers (Angrist et al. 1996). Gompliers (GL) are those who would offer intervention services only if 



* The ITT estimates under this design pertain to the combined effeets of the intervention on student mobility 
and student outeomes, beeause of potential intervention effeets on the fraetion and types of students who enter and 
leave the study sehools. 
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they were assigned to the treatment group [^ (1) = 1 and i?, (0) = 0], Never-takers (N) are those who 
would never offer treatment services [ -^, (1) = 0 and -^ (0) = 0 ], and always-takers (A) are those would 
always offer treatment services [^ (1) = 1 and -^, (0) = 1]. Finally, defiers (D) are those who would offer 
treatment services only in the control condition [-^, (1) = 0 and -^, (0) = 1]. Outcome data are assumed to 

be available for all sample members. Note that this scenario applies also to nonclustered designs where 
units are students. 

The ITT parameter for the pooled sample can be expressed as a weighted average of the ITT 
parameters for each of the four unobserved compliance groups: 

( 15 ) CCjj'j' — PcL^ITT CL Pn^ITT N Pa^ITT A Pd^ITT D ’ 

where pg is the fraction of the study population in compliance group gi'Y^Pg = 1 )> and a^j. ^ is the 
associated ITT impact parameter (as defined earlier). 

Following Angrist et al. (1996), the parameter in (15) can then be identified under three key 

assumptions (U1-U3): 

Ul. The Unit-Level Stable Unit Treatment Value Assumption (SUTVA): Unit-level potential 
compliance decisions [^,(7’)] and outcomes are unrelated to the treatment 

status of other units. This allows us to express Rff) and WfT^,R^) in terms of f rather 

than the vector of treatment statuses of all units. This condition is likely to hold in clustered 
education RCTs where random assignment is conducted at the school level (the most 
common design), unless there is substantial interaction between students and staff across 
study schools. 

U2. Unit-Level Monotonicity: 7?;(1) > 7?, (0) . This means that units are at least as likely to offer 
intervention services in the treatment than control condition, and implies that there are no 
defiers (that is, = 0 ). Under this assumption, = P(7?.(l) = 1) — P(7?, (0) = 1) , which 
is the difference between service offer rates in the treatment and control conditions. 

Ult.The Unit-Level Exclusion Restriction: IVfl,r) — fV.(0,r) for r — 0,1. This means that the 

outcome for a unit that offers services would be the same in the treatment or control 
condition, and similarly for a unit that does not offer services. Stated differently, this 
restriction implies that any effect of T. on outcomes must occur only through an effect of f 
on service offer rates. This restriction implies that impacts on always-takers and never-takers 
are zero, that is, a^^j. = a^jj. j = 0 . 

Under these assumptions, the final three terms on the right-hand-side of (15) cancel. Thus, the following 
CAGE impact parameter can be identified: 

(16) 

^CACEO CCiTT CL ^ \P^ITT^ PcL^' 

This parameter represents the average causal effect of the treatment for compilers. 
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Importantly, follow-up data on all sample members are required to estimate the CAGE parameter even 
though this parameter pertains to the complier subgroup only. Thus, noncompliance is different than data 
nonresponse. 



The CAGE Parameter When Compliance Decisions Are Made by Units and Students 

In this section, we generalize the GAGE parameter from above to the case where compliance decisions are 
made by both school staff and students. For this analysis, we require assumptions on both students and 
schools to identify the GAGE parameter. 

Table 4. 1 displays and labels the 16 possible student-level complier groups that depend on treatment 
status ( T’ ), whether the school offers services ( ), and whether the student receives services ( Z)^. ). In 

this scenario, there are four groups each of compliers, never-takers, defiers, and always-takers. For 
example, Never-Taker Group 2 includes students who would never receive services even though their 
schools would always offer them. Note that students with Z?. = 0 and D.j = 1 are assumed to receive 

services from a different study school than their baseline school. The frequency of each of the 1 6 
combinations will depend on the particular application, and some may be rare. Flowever, all combinations 
are included for completeness. 

To derive the GAGE parameter under this scenario, we define - E - Y^.j \ W „ , , p,, , . . ., ) 

as the within-unit ITT for the student population in unit i, where p^. is the fraction of students in 

16 

compliance group gC^Pg = 1 )• Note that E(a\jj) — cc^j, where the expectation is taken over the joint 

g=i 

unit-level potential outcome and compliance distributions. Note next that can be expressed as a 
weighted average of the within-unit ITT parameters for each of the 1 6 student-level compliance groups 
shown in Table 4. 1 : 

16 

(17) CCjjj. — ’y^.PojOtjTT g ’ 

where a‘i„ ^ is the impact parameter for compliance group g. 
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Table 4.1: Possible Student-Level Compliance Groups 



Compliance Status in tbe Compliance Status in tbe 

Treatment Condition Control Condition 



Compliance Group: 
Number and Label 


Unit: 

^,<1) 


student: 

Ay[l,/?,(1)1 


Unit: 

Rm 


student: 

Ay[0,/?,(0)1 


1. Complier 1 


1 


1 


0 


0 


2. Always-Taker 1 


1 


1 


0 


1 


3. Complier 2 


1 


1 


1 


0 


4. Always-Taker 2 


1 


1 


1 


1 


5. Never-Taker 1 


1 


0 


0 


0 


6. Defier 1 


1 


0 


0 


1 


7. Never-Taker 2 


1 


0 


1 


0 


8. Defier 2 


1 


0 


1 


1 


9. Complier 3 


0 


1 


0 


0 


10. Always-Taker 3 


0 


1 


0 


1 


1 1 . Complier 4 


0 


1 


1 


0 


12. Always-Taker 4 


0 


1 


1 


1 


13. Never-Taker 3 


0 


0 


0 


0 


14. Defier 3 


0 


0 


0 


1 


15. Never-Taker 4 


0 


0 


1 


0 


16. Defier 4 


0 


0 


1 


1 



Note: i?/(l)=l and Z),y(l,l)=l if the student’s unit (for example, school) would offer intervention 

services in the treatment condition and the student would then agree to receive services, and 
similarly for other combinations. 
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Using (17) and Table 4.1, the CAGE parameter for Complier Group 1 can be identified under the 
following assumptions (that are analogs to the unit- level assumptions from above): 

5 1 . SUTVA: Potential student-level service receipt decisions [D.j (T. , i?. )] and outcomes 
[Y.j{T.,R.,D.j)] are unrelated to the treatment status of other students and schools. In 

addition, we impose the unit-level SUTVA condition U3 from above that Rj(Tj) is 
unrelated to the treatment status of other units. 

52. Monotonicity on Compliance: ^ (1) ^ ^, (0) or (1, i?. (1)) > Z)jy(0,i?j (0)) . This 

assumption will be satisfied if a unit is at least as likely to offer services in the treatment 
than control condition, or if students in that unit are at least as likely to receive services in 
the treatment than control condition. Using Table 4.1, this condition implies that — 0 . 

53. Student-Level Monotonicity on the Take-Up of Services: ( 5 , 1) > {t, 0) for 

s,t ^ {0, 1} . This assumption means that students are at least as likely to take up services if 
they are offered them than if they are not, which implies that Z?6 “ Ai “ ^ • 

54. The Student-Level Exclusion Restriction on Compliance: D^j (1, r) = D^j (0, r) for r = 0, 1 . 

This means that for a given service offer status, the student’s compliance decision would be 
the same in the treatment or control condition. Stated differently, this restriction implies that 

any effect of f on student compliance decisions must be a result of a treatment effect on 
service offer rates. Using Table 4.1, this restriction implies that A “ A “ A “ P\a ~ ^ • 

55. The Student-Level Exclusion Restriction on Outcomes: Y.j (1, i?. (1), J) = Y^ (0, (0), J) 

for (i = 0, 1 . This means that student outcomes are determined solely by whether or not the 
student receive services, and it does not matter where these services are received (or not 
received) or how many other students are receiving them. This restriction implies zero 
impacts for Groups 2, 4, 5, 7, 10, 12, 13, and 15. 

These assumptions imply that the only term on the right-hand-side of (17) that does not cancel is the first 
term that pertains to Complier Group 1 (see Table 4.1). Thus, after taking expectations in (17), the 
following CAGE parameter can be identified: 

(18) 

^CACE ~ Cl ITT / Pi.P\i) ~ Pi.P\iCl ITT \) / ^(Pli) • 

This CAGE parameter is a weighted average of within-unit impacts for Complier Group 1, with weights 
A, (and reduces to E{a‘jj,j. ,) if j and A/ are independent). Denoting =E{pg^), assumptions 

S2 to S4 imply that A ^{Pt~Pc)^ where A = A + A + A + Ao + Pn and Pc^Pi+Pa+ Ao + Pn 
are the fractions of students receiving services in the treatment and control conditions, respectively. Thus, 
the only difference between in (16) and in (18) is that Ar refers to service offer rates for 

units whereas p^ refers to service receipt rates for students. Clearly, (1 8) is more general and reduces to 
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(16) if compliance decisions are made by school staff only. Thus, in what follows, we focus on estimation 
issues for the parameter.^ 



Impact and Variance Estimation of the CAGE Parameter 

In this section, we discuss estimation of the CAGE parameter. We use an IV approach, because simple 
closed- form variance formulas exist, the variance correction terms can easily be understood because they 
enter the formulas linearly, and the formulas can be readily generalized to the standardized CAGE 
parameter. An alternative approach, without these properties, is to use (more efficient) maximum 
likelihood estimation methods and the EM algorithm (Jo et al. 2008). 



Impact Estimation 

A consistent estimator for in (1 8) can be obtained by dividing consistent estimators for ajjj and 

Pi ■ 



( 19 ) ^CACE ~ ^ITT ^ Pi • 

Estimators for Py— Pj— Pc can be obtained by noting that this parameter represents an impact on the 
rate of service receipt. Thus, estimation methods similar to those discussed above for can be used to 
estimate p ^ . For example, analogous to (5), the simple differences-in-means estimator is 
Pj = {dj. - d(.), where d^j is an observed service receipt status indicator variable that equals 1 if student 
i in school j received intervention services, and zero otherwise. 



Variance Estimation 

The GAGE estimator in (19) is a ratio estimator (Tittle et al. 2008 and Heckman et al. 1994). Both the 
numerator and denominator are measured with error, and thus, both sources of error should be taken into 
account in the variance calculations. A variance estimator for OCcace c™ be obtained using an asymptotic 
Taylor series expansion of around the true value : 

(A —r/ 1 ~ (^/7T ~ ^m) (^ITT^Pl ~ P\^ 

^cace)~ 2 



^ A special case of our general framework is when always-takers (groups 2, 4, 10, and 12) are not present, 
possibly as a result of strict implementation rules ensuring that students from control schools cannot receive 
intervention services. In this case, recipients of intervention services belong only to complier group 1 in the 
treatment group, and the CAGE parameter is equivalent to the average treatment effect on those who receive services 
(the “treatment-on- the-treated” parameter). 
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Taking squared expectations on both sides of (20) and inserting estimators for unknown parameters yields 
the following variance estimator for OCqace • 



( 21 ) AsyVar(ac^c^) = 



Asy Var{oCijj ) a^^^^^Asy Var{p ^ ) 2d(.^(.^AsyCov{ajj.j . , ) 



Pi 



Pi 



Pi 



The first term in (21) is the variance of the CAGE estimator assuming that estimated service receipt rates 
are measured without error. The second and third terms are therefore correction terms. The second term 
accounts for the estimation error in p^ , and the third term accounts for the covariance between djj.^ and 
p^ ? Importantly, these correction terms depend on the size of and thus, become more important 

with larger impacts. Finally, because (Xjjj and are asymptotically normal, the delta method (Greene 
2000, p.l 18) implies that asymptotically normal. 



An asymptotic variance estimator for p^ that adjusts for clustering can be obtained from (10) and (11) 
using simple differences-in-means methods or from (13) and (14) using linear probability models, where 
y^j is replaced by . For our empirical work, we used a slightly different variance estimator that allows 
for different processes underlying service receipt decisions for treatments and controls: 

( 22 ) AsyVar{p,) = ^ + ^^ 

np n(l- p) 

np /V n(l-p) ^ 

where ^ - d. Y /[np -k] , ^ - d- Y l[n{\ - p)-k^, and k is the number of 

i-.Ti=l i:Ti=0 

unit-level covariates (including the intercept) that are included in the model. 

Similarly, an unbiased estimator for AsyCov{djjj.,Pi) is as follows: 

( 23 ) AsyCov{djj^ ,p^) = ^+ , 

np n{\- p) 

np ^ n(l-p) ^ 

where Sl^ = {y^ - % ){d^ -d^/inp-k] and ^ (j. - p, )(d, - d . ) l[n{\ -p)~k]. 

i-.Ti=l i:Ti=0 



Finally, the CAGE impact and variance estimators discussed above are IV estimators (Angrist et al. 1996). 
To see this, consider the following variant of the model in (3): 

(24) y y — oCg + GCcj^cpdy + (m,. -I- ) , 



2 

Little et al. (2008) and Heckman et al. (1994) ignore the covariance term. 
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where U-^ and are random error terms. If T. is used as an instrument for <i,, in (24), then the 
estimated IV regression coefficient is the simple differences-in-means CAGE estimator in (19) 

with the variance estimator in (21). Treatment status T’ is likely to be a “strong” instrument if service 

receipt rates differ markedly for treatment and control students (see Murray 2006 and Stock et al. 2002 for 
a discussion of weak and strong instruments)."' 



Due to the correction terms in (21), the correct p-values of the ITT and CAGE estimates will generally differ, 
and the choice of the parameter on which inference is conducted should be determined by the population of interest. 
However, when the problem of weak instruments precludes valid inference on the CAGE parameter, inference about 
the absence of an effect may have to be conducted solely on the ITT parameter. 
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Chapter 5: The Standardized ITT and CACE Estimators 



It is becoming increasingly popular in educational research to standardize estimated impacts into standard 
deviation units (Hedges 1981 and 2007). This approach can be used to facilitate the comparison of impact 
findings across outcomes that are measured on different scales. It has also been used extensively in meta- 
analyses to contrast and collate impact findings across a broad range of disciplines (Cohen 1988; Lipsey 
and Wilson 1993). The use of effect sizes is especially important for helping to understand impact 
findings on outcomes that are difficult to interpret when measured in nominal units (for example, impacts 
on behavioral scales or test scores). In addition, this approach is useful for creating composite measures 
across multiple outcomes, and for scaling an outcome that is measured differently across students (such as 
state achievement test scores from different states). Finally, it has become standard practice in education 
evaluations to conduct power analyses using primary outcomes that are measured in effect size units, to 
ensure adequate study sample sizes for detecting impacts that are meaningful and attainable based on 
findings from previous studies. 



Impact Estimation for the Standardized ITT Estimator 

The ITT parameter in effect size units, ^ , can be expressed as follows: 

(25) (X jYj £ (X jjj / CT ^ , 

where cr,, is the standard deviation of the outcome across all treatment and control students.^ 

An unbiased standard deviation estimator for cr^, = -I- cr^ can be obtained as follows: 

( 26 ) 5 

V m 

where S^g and 5^ are defined as in (9) and (7) above. Thus, a consistent estimator for g is: 
(27) ^ITT _E ~ ^ITT ^ ' 



Variance Estimation for the Standardized ITT Estimator 

The effect size estimator in (27) is a ratio estimator where both the numerator and denominator are 
measured with error. We discern, however, two competing views on whether it is necessary, when 
reporting impact results, to adjust the variance of this estimator for the estimation error in . One view, 

that opposes variance corrections, is that standardized impact estimators are descriptive statistics for 



^ In clustered designs, <r^ could also be defined as the within- or between-unit unit standard deviation, and could 

also be measured using the control group only or an outside sample (for example, a sample with pertinent data that is 
larger and more representative of the study “universe” than the sample for the current study) (Hedges 2007). All 
formulas below can be adapted using these alternative definitions for <r^ . 
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interpreting and benchmarking the impacts in nominal units. In this view, standardized outcomes are not 
measures per se, and thus, the nominal estimator is the relevant impact for assessing whether an education 
intervention had a significant impact on the outcome. The alternative view is that the standardized impact 
estimator is often the impact measure on which researchers and policymakers focus. Thus, standardized 
outcomes are effectively the outcome measures of interest, and standardized impacts should have proper 
standard errors attached to them. 

Given these opposing views, we believe that it is appropriate that impact studies report correct standard 
errors for ITT impact estimates in both nominal and effect size units. Thus, in what follows, we discuss 
simple asymptotic variance formulas for the standardized estimators (see Hedges 2007 for similar results 
using finite populations and unequal cluster sizes). 

A variance estimator for ajj-j. ^ can be obtained from the delta method using a Taylor series expansion of 
a^jj- £ around the true value a^j. ^ , which after inserting estimators for unknown parameters, yields the 
following expression: 



( 28 ) AsyVar{ajjj. ^) = - 



AsyVar(cCj„) a]„_,,AsyVar(Sy) 



- + 



where the asymptotic covariance term between a^jj. and can be shown to be zero using results on the 
independence of linear functions and quadratic forms for normal distributions. 

The first term in (28) is the variance expression for the effect size impact ignoring the estimation error in 
Sy (the usual approach found in the literature). The second term, therefore, is a correction term. This 

term increases as ^ increases, and is zero if and only if d^j. ^ = 0 . 



Finally, (28) requires an estimator for AsyVar(S^) , which can be obtained as follows (see Appendix A 
for a proof): 



( 29 ) 



AsyVar(Sy) = 



Si , (m-1)^; 

2{n-2)Sl 2nfnS] 



This expression also applies to nonclustered designs where units are defined as students. In this case, 
5^ = 0 and - Sl so that (29) reduces to /[2(t7 — 2)] . 



Impact and Variance Estimation for the Standardized CAGE Estimator 

Using results from above, a CAGE estimator in effect size units can be expressed as follows: 

(30) ^CACE E ~ ^ITT I^SyPi) , 

where it is assumed that the standard deviation for compliers is the same as it is for the full sample. Using 
the delta method, a variance estimator for ^ is: 
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nn ^ _ AsyVar{a„,) , d^^cj,_j,AsyVar(S^) d^^^,_,AsyVar(p,) 

Asyvarya(.^(.^ ^2-2 c 2 -2 

SyPy Sy Py 

'^^CACE ’ P \ ) 

e -2 ’ 

SyPl 

where we have ignored the covariance term between and p^ . The estimator ^ is 

asymptotically normal because each estimator component is asymptotically normal. 
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Chapter 6: Empirical Analysis 



RCTs in the education field often report the same significance levels for each of the ITT and CAGE 
estimators considered above. This chapter uses data from ten RCTs to assess this approach. 



Data 

Data for our analysis come from ten large published RCTs conducted by Mathematica Policy Research, 
Inc. (MPR). We selected these RCTs due to their significance for policy and their coverage of a wide 
range of interventions found in the education and social policy fields. Most of these evaluations were 
advised by national panels of evaluation and subject-area experts. Appendix Table B. 1 lists the RCTs and 
summarizes the basic features of each one, including the 20 key outcome variables selected for our 
analysis, the covariates used in regression adjustment, and the unit of random assignment (that is, the 
level of clustering). 

The RCTs include six evaluations of K-12 educational interventions. The remaining four RCTs include 
evaluations of interventions in welfare, labor, and early childhood education, which are included to help 
gauge the robustness of our findings beyond the K-12 setting. Overall, the ten studies span a wide range 
of outcomes, geographic areas, and target populations, and there is a mix of clustered and nonclustered 
designs. All ten studies were used for the standardized ITT analysis. 

The CAGE analysis was conducted using data from seven RCTs where noncompliers were identified 
using service receipt data. Appendix Table B.2 provides definitions of program “participation” used in 
our GAGE analysis, and shows unadjusted service receipt rates in the treatment and control groups. For 
the 2C‘ Century, New York City Voucher, Power4Kids, Early Head Start, and Job Corps evaluations, we 
defined program participation using the same rules as used by the studies. The Teacher Induction and 
Education Technologies evaluations did not conduct GAGE analyses, so we developed illustrative rules 
for defining noncompliers using available service receipt data. The GAGE analysis was not conducted for 
the remaining three RCTs (the Teach for America, San Diego Food Stamp Cash-Out, and Teenage Parent 
Demonstration evaluations) due to full compliance of study subjects. 

For the GAGE analysis, individuals were coded as service recipients if they received at least a minimal 
amount of services. It is appropriate to set the bar low for defining service receipt to ensure that impacts 
on never-takers are likely to be zero (see assumptions U3 and S5 above). 



Methods 

Data from each RCT were used to obtain (1) uncorrected variance estimators where the denominator 
terms of the impact estimators were assumed to be known, and (2) corrected variance estimators that 
accounted for all sources of estimation error. 

Variance estimators for oCjjj. ^ were obtained using (28). To apply (28), we estimated between-unit 

ANCOVA models to obtain auj. and then used (14) to obtain AsyVar(djjj.) . The ANCOVA models 
included covariates as similar as possible to those used in the published studies (see Table B.l).'’ The 

^ The impact estimates and uncorrected variance estimates that we report are slightly different than those 
reported in the published study reports due to the standardization of the estimation methods that we used across 
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estimation of (7^ and AsyVar(S^) involved a straightforward application of (26) and (29). Equations 

(31), (22), and (23) were used to obtain AsyVar(a^j^^ . Similar impact and variance results were 

found using simple differences-in-means procedures and the other estimation methods discussed above 
(not shown). 

The CAGE analysis required the estimation of the fraction of individuals who were compliers. To do 
this, we defined a binary variable d^j that was set to 1 if the individual received services and zero 

otherwise. We then estimated as the coefficient on T’ from a between-unit regression of on T’ and 

the same covariates that were used to estimate the ITT parameters. Similar results were found using 
simple differences-in-means procedures and logit models. 

For each outcome, we quantified the importance of the variance corrections in two ways. First, we 
calculated the difference between the corrected standard error (the square root of the sum of the two terms 
in (28) or four terms in (31)) and the uncorrected standard error (the square root of the first terms in (28) 
or (31)) as a percentage of the uncorrected standard error. Second, we used t-statistics to assess the effect 
of the variance corrections on the statistical significance of djjj ^ and cCq^ce e calculating the 

absolute difference between the corrected and uncorrected p-values.’ 

The importance of the variance corrections will depend on the size of the impact estimates. Thus, to 
assess the sensitivity of our main findings to larger impacts than were typically found in the considered 
RCTs, we conducted simulations assuming that impacts were 0.25 standard deviations, which is a value 
that education RCTs are often powered to detect (Schochet 2007). For these simulations, variances of 
nominal ITT impact estimators were assumed to be the same as those observed in the data. Finally, for 
each outcome, we conducted a related analysis by identifying the smallest positive impact values for 
which the variance corrections would raise the standard errors of the impact estimators by 5 percent from 
the uncorrected values. Such an increment to the standard errors would cause an impact estimate with an 
uncorrected p-value of 0.04 to become, as a result of the correction, barely insignificant at the 0.05 level. 

Because our variance formulas are based on asymptotic normality of the impact estimators and assume 
equal cluster sizes, they are only approximations. Thus, to evaluate whether our formulas apply well to 
sample and cluster sizes that arise in practice, we compared p-values based on our variance formulas with 
those based on a nonparametric bootstrap. The two methods yield very similar p-values (Appendix Table 
B.3). 



(continued) 

studies, and small differences in covariate sets, the treatment of strata, and weighting schemes. However, the two 
sets of findings are very similar (see Appendix Table B.l). 

^ We focus on absolute, rather than percentage changes in the p-values because a large percentage change in a 
p-value may have only a trivial effect on statistical significance if the original, uncorrected p-value is already small. 
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Results 



The nominal ITT estimates are statistically significant at the five percent level for half of the 20 outcomes 
included in the analysis (Table 6.1; Column 4). These significance levels also apply to the standardized 
ITT and CAGE estimates (discussed later) using the uncorrected variance estimates. Estimates of ajjj ^ 

are less than 0.15 standard deviations for 16 of the 20 outcomes (Table 6.1; Column 5). The Power4Kids 
study had the largest intervention effects (0.38 and 0.22 standard deviations for the two reading outcomes, 
respectively). 

Compliance rates varied somewhat across the 7 studies included in the CAGE analysis (Table 6.2; 

Column 2). The compliance rate was at least 88 percent in four RCTs, and ranged from 72 to 77 percent 
in the three other RCTs. By construction, ^ becomes closer to Scjj^ ^ as estimated compliance 

rates increase (Table 6.2). 



Is Accounting for the Estimation Error in the Denominator of dj„ ^ Important? 

The answer to this question is “no.” We find strong evidence that accounting for the estimation error in 
S^, has a negligible effect on the standard error of djj^ ^ (Table 6.3). In our data, the correction term 

raises the standard error of ^ by less than one-quarter of 1 percent for 1 8 out of 20 outcome 

variables, and the correction never increases the standard error by more than 2 percent (Table 6.3; 

Column 5). Similarly, the correction has a trivial effect on the statistical significance of djjj. ^ . As 

shown in the final column of Table 6.3, the correction changes the p-value of djj-j, ^ only at the fourth or 
higher decimal place. 

The correction for the estimation error in S^, would remain ignorable even if the ITT estimates were 0.25 
standard deviations, which is larger than most of our observed djjj. ^ values (Table 6.4; Column 2). As 
expected, when djjj. ^ is set to 0.25, the correction becomes more important than before, but the 
correction still has a very small effect on the standard errors (less than a 2 percent increase in all but one 
instance). Similarly, the p-value of d^j ^ is hardly affected; the absolute increase in the p-value due to 

the correction never exceeds 0.001 (Table 6.4; Column 3). In fact, if dj„ ^ were 0.25, the t-statistic of 

the estimate would typically be so far out in the right tail of the distribution that the slight decrease in the 
t-statistic from the correction would leave the p-value virtually unchanged. 

Similarly, we find that on average across the considered RCTs, ^ would need to be about 0.8 

standard deviations for the corrections to increase the standard error of djj^. ^ by 5 percent (last column 

in Table 6.4). This is a large effect size in social policy evaluations, and is more than double the largest 
ITT impact found in our studies. 
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Table 6.1: 


Dependent Variable Information and ITT Impact Estimates in Nominal and Effect Size Units, 
by Study 






Dependent Variable Information 


ITT Impact Estimate’* 


Study and Dependent 
Variable 


Measurement Standard 

Units Deviation 


T.. . 1 TT Standard 

NommalUnits . .. xt .. 

Deviation Units 



21st Century 



Reading score 


Percentiles 


25.57 


-0.70 


-0.027 


Math course grade 


Percentage points 


9.98 


-0.62 


-0.062 


Teach for America 


Reading score 


nce” 


21.99 


0.34 


0.016 


Math score 


nce’’ 


18.56 


2.35* 


0.127* 


Education Technologies 


Grade 1 reading score 


NCE" 


20.62 


0.28 


0.014 


Grade 4 reading score 


nce’’ 


18.81 


0.31 


0.017 


NYC Vouchers 


Reading score 


Percentiles 


23.08 


0.74 


0.032 


Math score 


Percentiles 


23.50 


0.82 


0.035 


Power4Kids 


Word attack score 


Standard points 


10.77 


4.06* 


0.377* 


GRADE score 


Standard points 


14.33 


3.16* 


0.221* 


Teacher Induction 


Whether stay in district 


Binary outcome 


0.38 


-0.01 


-0.030 


Lesson implement score 


Scale points 


0.92 


-0.01 


-0.015 


Early Head Start 


Bayley MDI score 


Scale points 


12.63 


1.42* 


0.113* 


HOME score 


Scale points 


4.79 


0.39* 


0.081* 


Job Corps 


Earnings 


Dollars per week 


195.02 


12.04* 


0.062* 


Arrests 


Number 


1.44 


-0.14* 


-0.095* 


Cash-Out 


Value of purchased food 


Dollars per week 


42.36 


-6.38* 


-0.151* 


Energy as percent of RDA 


Percentage points 


62.21 


-7.41* 


-0.119* 


Teenage Parent 


Percent of months active 


Percentage points 


32.94 


5.96* 


0.181* 


Earnings 


Dollars per month 


268.53 


18.79 


0.070 



Source: Data from studies listed in Appendix Table B.l. 

Note: Impact estimates are regression-adjusted using the covariates indicated in Appendix Table B.l. 

‘‘ ITT impacts are estimated by the authors. See Appendix Table B. 1 for nominal ITT impact estimates from the 
published study reports. 

Denotes normal curve equivalents. 

* The ITT impact estimate is significantly different from zero at the 0.05 level, two-tailed test using the uncorrected 
standard error. 
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Table 6.2: Standardized ITT and CAGE Impact Estimates, by Stndy 



Estimated Fraction of Standardized Impact Estimates 

Individnals in tbe Stndy 

Stndy and Dependent Variable Wbo Are Compilers ITT CAGE 



2Ist Centnry 



Reading score 


0.753 


-0.027 


-0.036 


Math course grade 


0.765 


-0.062 


-0.081 


Education Technologies 


Grade 1 reading score 


0.881 


0.014 


0.016 


Grade 4 reading score 


0.890 


0.017 


0.019 


NYC Vouchers 


Reading score 


0.745 


0.032 


0.043 


Math score 


0.745 


0.035 


0.047 


Power4Kids 


Word attack score 


0.996 


0.377* 


0.379* 


GRADE score 


0.996 


0.221* 


0.222* 


Teacher Induction 


Whether stay in district 


0.947 


-0.030 


-0.031 


Lesson implement score 


0.960 


-0.015 


-0.016 


Early Head Start 


Bayley MDI score 


0.919 


0.113* 


0.123* 


HOME score 


0.919 


0.081* 


0.088* 


Job Corps 


Earnings 


0.717 


0.062* 


0.086* 


Arrests 


0.718 


-0.095* 


-0.133* 



Source: Data from studies listed in Appendix Table B. 1 . 

Note: Impact estimates and estimated fractions of individuals who are compliers are regression-adjusted using 

the covariates indicated in Appendix Table B. 1 . 

* The impact estimate is significantly different from zero at the 0.05 level, two-tailed test using the uncorrected 
standard error. 
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Table 6.3: Uncorrected and Corrected Standard Errors of ^ , by Stndy 



Standard Error of d,„ ^ Absolnte Change 



Stndy and Dependent 
Variable 


^ITT E 

(xlO^) 


Uncorrected 

Value 

(xlO^) 


Corrected 

Value 

(xlO^) 


Percentage 
Change due 
to the 
Correction 


in the /;-value of 

djjj. £ due to 

the Correction 
(xlO^) 


21st Centnry 


Reading score 


-2.7 


4.142 


4.142 


0.01 


0.280 


Math course grade 


-6.2 


4.873 


4.874 


0.03 


1.179 


Teach for America 


Reading score 


1.6 


3.436 


3.437 


0.02 


0.636 


Math score 


12.7 


5.523* 


5.534* 


0.21 


2.736 


Education Technologies 


Grade 1 reading score 


1.4 


4.481 


4.481 


0.00 


0.043 


Grade 4 reading score 


1.7 


3.797 


3.798 


0.01 


0.200 


NYC Vouchers 


Reading score 


3.2 


5.138 


5.138 


0.01 


0.363 


Math score 


3.5 


5.328 


5.329 


0.01 


0.404 


Power4Kids 


Word attack score 


37.7 


8.020* 


8.151* 


1.63 


0.011 


GRADE score 


22.1 


9.571* 


9.609* 


0.40 


5.134 


Teacher Induction 


Whether stay in district 


-3.0 


7.100 


7.100 


0.01 


0.154 


Lesson implement score 


-1.5 


8.547 


8.547 


0.00 


0.018 


Early Head Start 


Bayley MDI score 


11.3 


4.417* 


4.421* 


0.10 


0.772 


HOME score 


8.1 


4.009* 


4.011* 


0.06 


1.187 


Job Corps 


Earnings 


6.2 


1.965* 


1.965* 


0.02 


0.041 


Arrests 


-9.5 


1.935* 


1.936* 


0.05 


0.000 


Cash-Out 


Value of purchased food 


-15.1 


6.066* 


6.074* 


0.13 


1.146 


Energy as percent of RDA 


-11.9 


6.066* 


6.072* 


0.09 


2.024 


Teenage Parent 


Percent of months active 


18.1 


4.750* 


4.761* 


0.22 


0.048 


Earnings 


7.0 


4.815 


4.817 


0.03 


1.332 



Source: Data from studies listed in Appendix Table B. 1 . 

Note: Impact estimates and standard errors are regression-adjusted using the covariates indicated in 

Appendix Table B.l. The percentage change in the standard error due to the correction is equal to the 
difference between the corrected and uncorrected standard error, divided by the uncorrected standard 
error, and multiplied by 100. 

* The standardized /7T estimate is significant at the 0.05 level, two-tailed test using the indicated standard error. 
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Table 6.4: Simulated Effects of Variance Corrections on the Standard Error of O-itt £,for an Assumed /TT 



Impact Value of 0.25 and by Study 








Percentage Change 


Absolute Change 


Value of dj„ ^ so that 




in the Standard Error 


in the />-value (x 10‘‘) 


the Correction Will 




of j, due to the 


of d,jj £ due to the 


Increase the 


Study and Dependent 


Correction if ^ 


Correction if d,„ ^ 


Standard Error 


Variable 


Were Equal to 0.25 


Were Equal to 0.25 


by 5 Percent 


21st Century 


Reading score 


0.55 


0.000 


0.762 


Math course grade 


0.42 


0.000 


0.872 


Teach for America 


Reading score 


4.87 


0.000 


0.253 


Math score 


0.80 


0.011 


0.631 


Education Technologies 


Grade 1 reading score 


0.61 


0.000 


0.723 


Grade 4 reading score 


1.41 


0.000 


0.475 


NYC Vouchers 


Reading score 


0.54 


0.002 


0.770 


Math score 


0.50 


0.003 


0.802 


Power4Kids 


Word attack score 


0.72 


1.430 


0.666 


GRADE score 


0.51 


3.536 


0.793 


Teacher Induction 


Whether stay in district 


0.36 


0.209 


0.944 


Lesson implement score 


0.36 


1.182 


0.941 


Early Head Start 


Bayley MDI score 


0.48 


0.000 


0.814 


HOME score 


0.54 


0.000 


0.771 


Job Corps 


Earnings 


0.37 


0.000 


0.925 


Arrests 


0.37 


0.000 


0.924 


Cash-Out 


Value of purchased food 


0.35 


0.024 


0.961 


Energy as percent of RDA 


0.39 


0.027 


0.905 


Teenage Parent 


Percent of months active 


0.43 


0.000 


0.867 


Earnings 


0.42 


0.000 


0.872 



Source: Data from studies listed in Appendix Table B. 1 . 

Note: Standard errors are regression-adjusted using the covariates indicated in Appendix Table B.l. The 

percentage change in the standard error due to the correction is equal to the difference between the 
corrected and uncorrected standard error, divided by the uncorrected standard error, and multiplied by 
100 . 
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Is Accounting for the Estimation Error in the Denominator of e Important? 

The answer to this question is also “no.” We find that the variance corrections exert a bit more influence 
on the variance estimates for ^ than dcjjj ^ , but the influence is still generally very small; only in 

rare instances do these corrections change the variance estimates by more than 1 percent. 

Our key finding is that the standard error of e does not rise noticeably when correction terms 

involving and are included in the variance calculations (Table 6.5). The corrections increase the 

uncorrected standard errors by less than 0.5 percent for all studies except for the Word Attack Score in the 
Power4Kids study where the increase is 1.6 percent (Table 6.5; Column 5). The effect of the corrections 
on j3-values is negligible; the corrections never raise or lower the p-value by more than 0.001 (Table 6.5; 
last column). 

We find also that none of the individual correction terms in equation (31) is consistently important (Table 
6.6). For 12 out of 14 outcome variables, every correction term is less than 0.5 percent of the uncorrected 

variance value for ^ . Interestingly, AsyCov{d , p^) has no consistent sign. In some instances, 

the variance reduction due to a negative covariance term offsets the positive variance contributions of the 
other correction terms. This explains why in some cases the corrections reduce the standard errors shown 
in Table 6.5 (as indicated by negative values in the fifth column of Table 6.5). 

Simulations suggest that the results remain unchanged if e d.25 (Table 6.7; Columns 2 

and 3). For this scenario, for all but one outcome, the correction terms raise the standard error of ^ 

by less than 2 percent; the corresponding rise in the p-value never exceeds 0.001. Furthermore, on 
average across the considered RCTs, the standardized CAGE impact would need to be 0.7 standard 
deviations for the corrections to raise the standard error of oCcace £ by 5 percent (Table 6.7; Column 4). 
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Table 6.5: Uncorrected and Corrected Standard Errors of ^ , by Stndy 



Standard Error of cc^ace e 

Absolnte Change 
Percentage in tbe />-valne of 



Study and Dependent 
Variable 


^CACE E 

(xlO^) 


Uncorrected 

Value 

(xlO^) 


Corrected 

Value 

(xlO^) 


Change 
due to the 
Correction 


^CACE E due to 
the Correction 
(xlO^) 


21st Century 


Reading score 


-3.6 


5.498 


5.499 


0.01 


0.247 


Math course grade 


-8.1 


6.368 


6.378 


0.16 


7.173 


Education Technologies 


Grade 1 reading score 


1.6 


5.084 


5.081 


-0.07 


-1.647 


Grade 4 reading score 


1.9 


4.265 


4.262 


-0.07 


-2.363 


NYC Vouchers 


Reading score 


4.3 


6.901 


6.902 


0.02 


0.992 


Math score 


4.7 


7.157 


7.164 


0.10 


4.369 


Power4Kids 


Word attack score 


37.9 


8.054* 


8.186* 


1.64 


0.012 


GRADE score 


22.2 


9.612* 


9.653* 


0.42 


5.412 


Teacher Induction 


Whether stay in district 


-3.1 


7.500 


7.503 


0.04 


1.117 


Lesson implement score 


-1.6 


8.900 


8.902 


0.02 


0.330 


Early Head Start 


Bayley MDI score 


12.3 


4.806* 


4.811* 


0.10 


0.808 


HOME score 


00 

00 


4.362* 


4.359* 


-0.06 


-1.324 


Job Corps 


Earnings 


8.6 


2.740* 


2.741* 


0.03 


0.055 


Arrests 


-13.3 


2.694* 


2.695* 


0.04 


0.000 



Source: Data from studies listed in Appendix Table B. 1 . 

Note: Impact estimates and standard errors are regression-adjusted using the covariates indicated in 

Appendix Table B.l. The percentage change in the standard error due to the correction is equal to the 
difference between the corrected and uncorrected standard error, divided by the uncorrected standard 
error, and multiplied by 100. 

* The standardized CAGE impact estimate is significantly different from zero at the 0.05 level, two-tailed test using 
the indicated standard error. 
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Table 6.6: Components of Variance Corrections for ^ , by Study 

Percentage Change in tbe Estimated Variance of ^ 



due to Correction Terms Involving: 



Study and Dependent 
Variable 


All Corrections 


Variance of the 
Sample Standard 
Deviation of the 
Dependent 
Variable 


Variance of the 
Estimated 
Fraction of 
Individuals who 
Are Compilers 


Covariance 
Between the 
Nominal ITT 
Estimator and the 
Fraction who Are 
Compilers 


21st Century 


Reading score 


0.01 


0.01 


0.02 


-0.02 


Math course grade 


0.32 


0.05 


0.07 


0.19 


Education Technologies 


Grade 1 reading score 


-0.14 


0.00 


0.02 


-0.16 


Grade 4 reading score 


-0.15 


0.01 


0.04 


-0.20 


NYC Vouchers 


Reading score 


0.05 


0.02 


0.04 


0.00 


Math score 


0.21 


0.02 


0.04 


0.15 


Power4Kids 


Word attack score 


3.31 


3.29 


0.09 


-0.07 


GRADE score 


0.84 


0.79 


0.02 


0.02 


Teacher Induction 


Whether stay in district 


0.07 


0.01 


0.00 


0.06 


Lesson implement score 


0.05 


0.00 


0.00 


0.04 


Early Head Start 


Bayley MDI score 


0.21 


0.20 


0.07 


-0.06 


HOME score 


-0.13 


0.11 


0.04 


-0.28 


Job Corps 


Earnings 


0.06 


0.05 


0.06 


-0.05 


Arrests 


0.08 


0.11 


0.15 


-0.18 



Source: Data from studies listed in Appendix Table B. 1 . 

Note: The indicated correction terms in the final three columns denote the second, third, and fourth terms, 

respectively, on the right hand side of equation (31). The percentage change in the estimated variance 
of the standardized CAGE impact estimator due to the indicated correction term is equal to the 
indicated correction term divided by the first term on the right hand side of equation (31), and 
multiplied by 100. 
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Table 6.7: Simulated Effects of Variance Corrections on the Standard Error of £?foran 



Assumed CAGE Impact Value of 0.25 and by Study 



Study and Dependent 
Variable 


Percentage Change in 
the Standard Error 
of ^CACE_E due to the 
Correction if «c. 4 C£ £ 
Were Equal to 0.25 


Absolute Change 
in the />-value (x 10“*) 
of ^cACE_E due to the 
Correction if «c. 4 C£ e 
W ere Equal to 0.25 


Value of «c.4Cis £ so 

that the Correction 
Will Increase the 
Standard Error 
by 5 Percent 


21st Century 


Reading score 


0.86 


0.011 


0.626 


Math course grade 


0.30 


0.043 


0.800 


Education Technologies 


Grade 1 reading score 


1.46 


0.004 


0.404 


Grade 4 reading score 


3.33 


0.000 


0.298 


NYC Vouchers 


Reading score 


0.88 


0.376 


0.601 


Math score 


1.22 


0.812 


0.564 


Power4Kids 


Word attack score 


0.71 


1.463 


0.663 


GRADE score 


0.53 


3.778 


0.783 


Teacher Induction 


Whether stay in district 


0.21 


0.221 


0.913 


Lesson implement score 


0.07 


0.285 


0.980 


Early Head Start 


Bayley MDI score 


0.49 


0.000 


0.779 


HOME score 


0.22 


0.000 


0.807 


Job Corps 


Earnings 


0.39 


0.000 


0.854 


Arrests 


0.63 


0.000 


0.787 



Source: Data from studies listed in Appendix Table B. 1 . 

Note: Standard errors are regression-adjusted using the covariates indicated in Appendix Table B. 1 . The 

percentage change in the standard error due to the correction is equal to the difference between the 
corrected and uncorrected standard error, divided by the uncorrected standard error, and multiplied by 
100 . 
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Chapter 7: Summary and Conclusions 

This report has examined the identification and estimation of the CAGE parameter for two-level clustered 
RCTs that are commonly used in education research, where groups (such as schools or classrooms) rather 
than students are the unit of random assignment. We generalized the causal inference and IV framework 
developed by Angrist et al. (1996) to develop conditions for identifying the CAGE parameter under 
clustered designs where multi-level treatment compliance decisions can be made by both school staff and 
students. 

This report also provides simple asymptotic variance estimation formulas for GAGE impact estimators 
measured in both nominal and standard deviation units. Because these IV impact estimators are ratio 
estimators, the variance formulas account for both the estimation error in the numerators (which pertain to 
the nominal ITT impact estimates) and the denominators (which pertain to the estimated service receipt 
rates and the estimated standard deviations of the outcomes). 

Researchers sometimes assume that the denominator terms in these ratio estimators are known, and thus, 
present the same p-values from significance tests for all ITT and GAGE impact estimates. This approach, 
however, could yield incorrect significance findings if the variance components due to the denominator 
terms matter. Accordingly, we used data from 10 large-scale RCTs in education and other social policy 
areas to compare significance findings for the considered impact estimates using uncorrected and 
corrected variance estimators. 

Our key empirical finding is that the variance correction terms have very little effect on the standard 
errors of the standardized /TT and GAGE impact estimators. Across the examined outcomes, the 
correction terms typically raise the standard errors by less than 1 percent, and change p-values at the 
fourth or higher decimal place. Furthermore, simulations indicate that, on average, the impact estimates 
would need to be 0.7 to 0.8 standard deviations, representing effect sizes that are rarely found in practice, 
before the variance corrections would raise the standard errors by 5 percent. These results occur because, 
by far, the most important source of variance in the considered ratio estimators is the variance of the 
nominal ITT impact estimators. 

Despite these results, we advocate, for rigor, that education researchers use the correct standard error 
formulas for standardized ITT and GAGE impact estimates. The formulas laid out in this report are 
relatively straightforward to apply, and their use will protect against the risk of finding incorrect 
significance findings, even if this risk is likely to be low based on our empirical findings. 
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Appendix A: Proof of Equation (29) 



Following Fledges (2007), note that n{m-V)Sl, / cr^ has an approximate chi-squared distribution with 
n(w-l) degrees of freedom. Thus ylxjFar(5'^) = 2cr^ /«(m -1) . Similarly, (n- 2)5' j /£'(5'5) has an 
approximate chi-squared distribution with n{m-\) degrees of freedom. Thus, 

AsyVar{Sl) = 2(al + / mjf l{n-2 ) . 



Using (26), we find then that: 



(A.I) AsyVar(Sl) = 



2(al + {al/m}f [(m -l)/wf 2cr, 



(n-2) 



-I-- 



n{m-l) 



To obtain a variance expression for 5 in terms of the variance expression for 5^ in (A.I), we apply a 



Taylor series expansion of 5^ around : 



(A.2) 



2a,. 



Because is asymptotically normal, the delta method implies that 5^ is asymptotically normal with the 
following asymptotic variance: 



(A.3) AsyVar(S^) ^ AsyVar(Sl) / 4a^ . 

2 2 0 

After some algebra, (29) follows after inserting (A.I) into (A.3) and replacing <T^ , , and cr,, by their 

estimators. 
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Appendix B: 



Table B.l: Summary of Data Sources 


Study(Authors; 

Sponsor)* 


Description of 
Program or 
Intervention and 
Study Design 


Original Study 
Population and 
Number of 
Treatment and 
Control 

Observations Used 
in Current 
Analyses'’ 


Level of 
Clustering 
(Intraclass 
Correlation 
Coefficient)*^ 


Outcome Measures 
(Corresponding 
Estimate of 
Nominal ITT 
Impact from 
Publisbed Study 
Report) 


Baseline 

Covariates 



Evaluation of 


Study examined 


Students in 


None 


Stanford-9 reading 


Baseline test 


the 21®' Century 


the effects of 


kindergarten to 6th 




score in second year 


scores in 


Community 


participation in 


grade in the 2000- 




of study (0.3); math 


reading and 


Learning 


after-school 


2001 school year 




course grade in 


math; grade 


Centers 


programs on 


within 12 




second year (-0.6) 


level; whether 


Program 


academic and 


unspecified school 






student is 


(James- 


behavioral 


districts 






overage for 


Burdumy et al. 


outcomes of 


857; 796 






grade; 


2005; lES) 


elementary school 






race/ethnicity; 




students in 12 








number of 




school districts and 








absences. 




26 centers. 








tardies, and 




Students interested 








suspensions in 




in attending after- 








year prior to 




school programs 








study; whether 




were randomly 








student has 




assigned to the 








been retained in 




treatment or 








any prior year; 




control groups 
within each after- 
school program 
center. 








site indicators 


Teach for 


Study examined 


1st to 5th graders in 


Teacher 


Iowa Test of Basic 


Baseline test 


America 


the impact of 


the 2001-2002 


(0.561) 


Skills (ITBS) 


scores in 


Evaluation 


teachers from 


school year; 17 


reading score (0.56); 


reading and 


(Decker et al. 


Teach for America, 


schools in 




ITBS math score 


math; grade 


2004; SRF; HE, 


a highly selective 


Baltimore, Chicago, 




(2.43) 


level; school 


CC) 


alternative 


Los Angeles, 






indicators 




certification 


Mississippi Delta, 










program, on the 
academic 


and New Orleans 










achievement of 
elementary school 
students. Students 
were randomly 
assigned to 
classrooms taught 
by Teach for 
America teachers 
or traditional 
teachers in the 
same grade and 
school. 


742; 911 
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TABLE B. 1 {continued) 



Table B.l: Summary of Data Sources 


Study(Authors; 

Sponsor)* 


Description of 
Program or 
Intervention and 
Study Design 


Original Study 
Population and 
Number of 
Treatment and 
Control 

Observations Used 
in Current 
Analyses'’ 


Level of 
Clustering 
(Intraclass 
Correlation 
Coefficient)*^ 


Outcome Measures 
(Corresponding 
Estimate of 
Nominal ITT 
Impact from 
Publisbed Study 
Report) 


Baseline 

Covariates 



Evaluation of 


Study examined 


Students in 1st 


Teacher 


1 st grade Stanford-9 


Baseline test 


Reading and 


the effects of 16 


grade, 4th grade. 


(0.197) 


reading score (0.73); 


scores; student's 


Mathematics 


software products 


6th grade, and 


4th grade Stanford- 


age and gender; 


Education 


on students' 


algebra classes in 




10 reading score 


teacher's 


Technologies 


academic 


the 2004-05 school 




(0.41) 


gender. 


(Dynarski et al. 


achievement in 1st 


year in 33 districts 






experience, and 


2007; lES) 


grade reading, 4th 








highest degree; 




grade reading, 6th 


1,160; 777 






school's 




grade math, and 








racial/ethnic 




algebra in 33 








composition; 




school districts. 








percent of 




Within each 








school's 




participating 








students eligible 




school, teachers 








for special 




were randomly 








education and 




assigned to use a 








subsidized 




study product or 
not. For the 








lunch 




purposes of our 
report, outcomes in 
1 st and 4th grades 
are used. 










New York City 


Study examined 


Low-income 


Family 


Iowa Test of Basic 


Baseline test 


School Voucher 


the effects of three- 


children enrolled in 


(0.436) 


Skills (ITBS) 


scores in 


Experiment 


year private school 


kindergarten to 


reading score in 


reading and 


(Mayer et al. 


scholarship offers 


fourth grade in 




third year of study 


math; 


2002; SCSF) 


on the academic 


1997 in New York 




(0.27); ITBS math 


randomization 




outcomes of 


City public schools 




score in third year 


strata indicators 




children from low- 
income families. 
Eligible families 
who applied for 
scholarships for 
their children were 
randomly selected 
for scholarships in 
a series of lotteries. 


672; 471 




(1.59) 
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TABLE B. 1 {continued) 



Table B.l: Summary of Data Sources 


Study(Authors; 

Sponsor)" 


Description of 
Program or 
Intervention and 
Study Design 


Original Study 
Population and 
Number of 
Treatment and 
Control 

Observations Used 
in Current 
Analyses'’ 


Level of 
Clustering 
(Intraclass 
Correlation 
Coefficient)*' 


Outcome Measures 
(Corresponding 
Estimate of 
Nominal ITT 
Impact from 
Published Study 
Report) 


Baseline 

Covariates 


PowerdKids 


Study examined 


3rd and 5th grade 


None 


Woodcock Reading 


School 


Study (Torgesen 


the impaet of four 


students in the 




Mastery Test- 


indicators; 


et al. 2006; lES) 


widely used 


2003-04 sehool year 




Revised Word 


baseline test 




remedial reading 


within 27 sehool 




Attack subtest score 


scores 




instraetional 


distriets near 




(5.0); Group 






programs on 


Pittsburgh, PA who 




Reading Assessment 






students' reading 


are identified as 




and Diagnostic 






skills. 50 sehools 


struggling readers 




Evaluation 






from 27 distriets 






(GRADE) Passage 






were randomly 


211; 127 




Comprehension 






assigned to one of 






subtest score (4.6) 





the interventions, 
and within eaeh 
sehool eligible 
ehildren who were 
identified as 
struggling readers 
were randomly 
assigned to reeeive 
the intervention or 
not. For the 
purposes of our 
report, the 
treatment group 
eonsists of students 
assigned to reeeive 
any of the four 
interventions, and 
third graders' 
outeomes are used. 



Evaluation of 


Study examined 


Beginning teachers 


School 


Binary variable 


Grade level 


Comprehensive 


the effects of 


in elementary 


(0.125) 


indicating that the 


taught; teacher's 


Teacher 


comprehensive 


schools within 1 7 


teacher stayed in the 


age, gender. 


Induction 


teacher induction 


low-income school 




same district from 


race/ethnicity. 


Programs 


programs on 


districts across 1 3 




the first year of the 


marital status. 


(Glazerman et 


teacher retention. 


states in the 2005- 




study to the start of 


household 


al. 2008; lES) 


teachers' classroom 


06 school year 




the second year 


structure. 




practices, and 






(0.002); teacher's 


teaching 




student outcomes. 


457; 425 




score, as assigned by 


experience. 




The comprehensive 






trained observer 


non-teaching 




programs provide 






using the Vermont 


experience. 




beginning teachers 






Classroom 


certification 




with an orientation. 






Observation Tool, 


status. 




mentoring sessions. 






for implementation 


preparation 




and professional 






of literacy lessons 


type. 




development. 






(0.0) 


educational 
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TABLE B. 1 {continued) 



Table B.l: Summary of Data Sources 






Original Study 
Population and 




Outcome Measures 








Number of 




(Corresponding 








Treatment and 


Level of 


Estimate of 






Description of 


Control 


Clustering 


Nominal ITT 






Program or 


Observations Used 


(Intraclass 


Impact from 




Study(Authors; 


Intervention and 


in Current 


Correlation 


Published Study 


Baseline 


Sponsor)* 


Study Design 


Analyses'’ 


Coefficient)*' 


Report) 


Covariates 


Teacher 


Within 17 








attainment. 


Induction 


participating 








college quality. 


(continued) 


districts. 








residential 


elementary schools 








location, and 




were randomly 








homeownership 




assigned to 








status; school's 




participate in 








racial/ethnic 




comprehensive 








and 




induction programs 








socioeconomic 




or to use their 








composition; 




district's existing 








district 




induction program. 








indicators 


Evaluation of 


Study examined 


Low-income 


None 


Bayley Mental 


Mother's age. 


Early Head Start 


the impacts of 


families with 




Development Index 


race/ethnicity. 


(Love et al. 


Early Head Start, 


infants and toddlers 




(MDI) score (1.456); 


English ability. 


2002; ACF) 


which provides 


aged 0 to 3 or 




Home Observation 


education. 




center-based or 


pregnant women 




for Measurement of 


employment 




home -based 


who applied to a 




the Environment 


status, living 




services to families 


study site in 1996 




(HOME) total score 


arrangements. 




with children aged 
0 to 3, on child 


879; 780 




(0.455)** 


number of 
children. 




development and 








household 




parenting 








income, welfare 




outcomes. Within 








receipt. 




each of 17 








resource 




participating 








adequacy. 




programs, eligible 








mobility, and 




applicants were 








random 




randomly assigned 








assignment 




to receive Early 








date; child's 




Head Start services 








age, birth 




or not. 








weight status. 



premature birth 
status, gender, 
and previous 
Head Start 
enrollment; site 
indicators 
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TABLE B. 1 {continued) 



Table B.l: Summary of Data Sources 


Study(Authors; 

Sponsor)* 


Description of 
Program or 
Intervention and 
Study Design 


Original Study 
Population and 
Number of 
Treatment and 
Control 

Observations Used 
in Current 
Analyses'’ 


Level of 
Clustering 
(Intraclass 
Correlation 
Coefficient)*^ 


Outcome Measures 
(Corresponding 
Estimate of 
Nominal ITT 
Impact from 
Publisbed Study 
Report) 


Baseline 

Covariates 



National Job 
Corps Study 
(Schochet et al. 
2001; DOL) 



Study examined 
the impacts of Job 
Corps, a large 
federal program 
providing 
educational and 
vocational training 
services to 
disadvantaged 
youth aged 16-24 
in a residential 
setting, on 
employment and 
related outcomes. 
Among youths 
applying to Job 
Corps in a thirteen- 
month period, a 
subset of applicants 
was randomly 
offered enrollment 
in Job Corps. 



Disadvantaged 
youth between the 
ages of 16 and 24 
who applied to Job 
Corps in 1995 and 
were determined 
eligible 

6,518; 4,298 



None 



Weekly earnings in None 

the fourth year of the 

study (15.9); total 

number of arrests 

during the four years 

of the study (-0.09) 



San Diego Food Study examined FSP recipients in None 

Stamp Cash-Out the effects of 1989 in San Diego 

Experiment cashing-out food County 

(Ohls et al. stamps on food- 

1992; FNS) purchasing and 613; 613 

food-use patterns 
of Food Stamp 
Program (FSP) 
participants in San 
Diego County. FSP 
households were 
randomly assigned 
to the cash-out 
status or the regular 
coupon status. 



Money value of None 

purchased food used 

at home by the 

household in the last 

seven days (-5.17); 

average availability 

of food energy per 

equivalent nutrition 

unit as a percentage 

of the recommended 

daily allowance 

(RDA) (-6.42) 
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TABLE B. 1 {continued) 



Table B.l: Summary of Data Sources 



Study(Authors; 


Descriptiou of 
Program or 
luterveutiou aud 


Origiual Study 
Populatiou aud 
Number of 
Treatmeut aud 
Coutrol 

Observatious Used 
iu Curreut 


Level of 
Clusteriug 
(lutraclass 
Correlatiou 


Outeome Measures 
(Correspoudiug 
Estimate of 
Nomiual ITT 
Impact from 
Published Study 


Baseliue 


Spousor)“ 


Study Desigu 


Aualyses'’ 


Coefficieut)*' 


Report) 


Covariates 



T eenage Parent 


Study examined 


Teenage mothers 


None Percentage of 


Teenage 


Demonstration 


the impacts of a 


who applied for 


months active in 


parent's 


(Maynard et al. 


demonstration 


AFDC for the first 


employment, job 


race/ethnicity. 


1993; ACF) 


program in 3 cities 


time in 1987 in 


training, or 


living 




for teenage welfare 


Camden NJ, 


education activities 


arrangements. 




mothers. The 


Newark NJ, and 


(6.1); average 


health barriers 




program required 


Chicago, IL 


monthly earnings 


to work. 




that welfare 




(24.4) 


English 




mothers participate 


805; 822 




proficiency. 




in employment, job 






contact with 




training, or 






father of child. 




education activities 






diploma status. 




in order to receive 






school 




full welfare 






enrollment 




benefits and also 






status, math 




provided child care 






skills, prior 




and transportation 






work 




assistance. All 






experience, and 




first-time teenage 






date of study 




welfare mothers 






entry; teen's 




were randomly 






residency with 




assigned to be 






own father and 




subject to the 






residency in 




enhanced set of 






welfare 




requirements and 






household as 




services or not. For 






child; education 




the purposes of our 






of teen's 




report, outcomes 






mother; age of 




from Chicago are 
used. 






teen's child 



‘‘Acronyms are defined as follows; IBS = Institute of Education Sciences at the U.S. Department of Education; SRF = 
Smith Richardson Foundation; HF= Hewlett Foundation; CC=Carnegie Corporation; SCSF = School Choice 
Scholarships Foundation; ACF = Administration for Children and Families at the U.S. Department of Health and Human 
Services; DOE = U.S. Department of Labor; FNS = Food and Nutrition Service of the U.S. Department of Agriculture. 

'’For each study, the size of the treatment and control group pertains to the sample used for estimating impacts on the first 
listed outcome variable. 

“ For each study, the intraclass correlation coefficient pertains to the first listed outcome variable and is estimated by the 
authors. 

‘*The published report for the evaluation of Early Head Start reported nominal CAGE impacts. Nominal ITT impacts are 
calculated as the published nominal CAGE impacts multiplied by 0.91, the reported fraction of individuals in the study 
who are compliers. 
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Table B.2: Information on the Receipt of Intervention Services, by Stndy 



Stndy 


Definition of Service 
Receipt Used in Cnrrent 
Analysis 


Same as 
Definition 
Used in 
Pnblished 
Stndy? 


Estimated 
Fraction of 
Individnals in the 
Treatment 
Gronp who 
Received 
Intervention 
Services 


Estimated 
Fraction of 
Individnals in the 
Control Gronp 
who Received 
Intervention 
Services 


21st Century 


Student attended a 
program center for at least 
1 day in the 2-year study 
period 


Yes 


0.926 


0.177 


Education Technologies 


Average student time 
using product in teacher's 
classroom was above 25% 
of the treatment group 
mean 


No CAGE 
analyses in 
published study 


0.885 


0.000 


NYC Vouchers 


Child attended private 
school in any year of study 


Yes 


0.854 


0.118 


PowerdKids 


Student received positive 
hours of intervention 


Yes 


0.991 


0.000 


Teacher Induction 


Teacher was assigned a 
mentor in a 

comprehensive induction 
program 


No GAGE 
analyses in 
published study 


0.950 


0.003 


Early Head Start 


Family received at least a 
minimal set of Early Head 
Start services 


Yes 


0.916 


0.000 


Job Corps 


Individual was ever 
enrolled in a Job Corps 
center in first three years 
of study 


Yes 


0.729 


0.012 



Source: Data from studies listed in Appendix Table B. 1 . 

Note: Estimated fractions of individuals who receive intervention services are averages of unit-level rates of 

service receipt in the relevant treatment status group and are not adjusted for covariates. For each 
indicated study, the estimation sample is the same as that used for estimating the CAGE impact on the 
first outcome variable listed in Appendix Table B.l. 
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Table B.3: Analytical and Bootstrap P-Values of CCjjj- ^ and e ^byStndy 

/•-valne of ^ P-valne of a(,^cE_E 

Stndy and Dependent 

Variable Analytical Bootstrap Difference Analytical Bootstrap Difference 



21** Century 

Reading score 
Math course grade 


0.509 

0.201 


0.513 

0.198 


-0.004 

0.003 


0.509 

0.202 


0.513 

0.196 


-0.004 

0.006 


Teach for America 

Reading score 
Math score 


0.650 

0.022 


0.661 

0.032 


-0.012 

-0.010 








Education Technologies 

Grade 1 reading score 
Grade 4 reading score 


0.759 

0.661 


0.777 

0.651 


-0.018 

0.010 


0.759 

0.661 


0.779 

0.649 


-0.020 

0.012 


NYC Vouchers 

Reading score 
Math score 


0.533 

0.515 


0.532 

0.511 


0.002 

0.004 


0.533 

0.515 


0.530 

0.512 


0.003 

0.003 


Power4Kids 

Word attack score 
GRADE score 


0.000 

0.022 


0.000 

0.024 


0.000 

-0.003 


0.000 

0.022 


0.000 

0.025 


0.000 

-0.003 


Teacher Induction 

Whether stay in district 
Lesson implement score 


0.676 

0.861 


0.685 

0.862 


-0.009 

-0.001 


0.676 

0.861 


0.684 

0.861 


-0.008 

0.000 


Early Head Start 

Bayley MDI score 
HOME score 


0.011 

0.044 


0.009 

0.039 


0.001 

0.005 


0.011 

0.044 


0.009 

0.039 


0.001 

0.004 


Job Corps 

Earnings 

Arrests 


0.002 

0.000 


0.004 

0.000 


-0.002 

0.000 


0.002 

0.000 


0.004 

0.000 


-0.002 

0.000 


Cash-Out 

Value of purchased food 
Energy as percent of RDA 


0.013 

0.050 


0.013 

0.047 


0.000 

0.003 








Teenage Parent 

Percent of months active 
Earnings 


0.000 

0.146 


0.000 

0.141 


0.000 

0.005 









Source: Data from studies listed in Appendix Table B. 1 . 

Note: The analytical /7-values of the standardized ITT [or CAGE] impact estimate were obtained using (28) [or 

(31)]. The bootstrap /7-values were obtained using the following steps: (1) Obtain the analytical t-statistic 
of the standardized impact estimate, as described previously; (2) Draw a stratified simulated sample of np 
treatment units and n{\-p) control units by sampling with replacement; (3) Use the simulated sample to 
calculate the simulated t-statistic, which is equal to the difference between the simulated and analytical 
standardized impact estimate, divided by the simulated value of the square root of (28) [or (31)]; (4) 
Repeat steps (2) and (3) an additional 4,999 times to obtain a total of 5,000 simulated i-statistics; (5) 
Calculate the proportion of simulated samples for which the absolute value of the simulated t-statistic 
exceeds the absolute value of the analytical t-statistic. 
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